1 Package Citation

Hai Fang, The ULTRA-DD Consortium, Julian C Knight. Pi: an R/Bioconductor package leveraging genetic evidence to prioritize drug targets at the gene and pathway level. Bioconductor (2018); doi:10.18129/B9.bioc.Pi

2 Package Installation

2.2 Packages

# stable release version from Bioconductor
source("http://bioconductor.org/biocLite.R")
biocLite("Pi")

# development version from github
source("http://bioconductor.org/biocLite.R")
biocLite("devtools")
devtools::install_github("hfang-bristol/Pi")

3 Bug Reports

We are grateful to have your feedbacks particularly bug reports. Please file issues here for any bugs or other technical matters.

4 Pipeline Overview

Pi pipeline. Seed genes defined from input GWAS summary data using scores for genomic predictors to determine a gene (denoted by circle) being functionally linked to the input disease associated genetic variant (denoted by triangle) based on proximity, conformation and expression, each represented as different pie segments; scores for annotation predictors (immune function/phenotype/disease) then only applied to such seed genes. Knowledge of network connectivity defines non-seed genes, with an example showing how network connectivity with highly prioritised seed genes can identify a non-seed gene (TNF) in rheumatoid arthritis. Predictor matrix generates numerical Pi prioritisation rating (scored 0-5) and ranking (on genome-wide).

Figure 1: Pi pipeline
Seed genes defined from input GWAS summary data using scores for genomic predictors to determine a gene (denoted by circle) being functionally linked to the input disease associated genetic variant (denoted by triangle) based on proximity, conformation and expression, each represented as different pie segments; scores for annotation predictors (immune function/phenotype/disease) then only applied to such seed genes. Knowledge of network connectivity defines non-seed genes, with an example showing how network connectivity with highly prioritised seed genes can identify a non-seed gene (TNF) in rheumatoid arthritis. Predictor matrix generates numerical Pi prioritisation rating (scored 0-5) and ranking (on genome-wide).

5 Core Functions

Pi functions. Designed in a nested way following this route: *[`xPierSNPsAdv`](http://rawgit.com/hfang-bristol/Pi/master/inst/xPierSNPs.html)* -> *[`xPierSNPs`](http://rawgit.com/hfang-bristol/Pi/master/inst/xPierSNPs.html)* -> *[`xPierGenes`](http://rawgit.com/hfang-bristol/Pi/master/inst/xPierGenes.html)* -> *[`xPier`](http://rawgit.com/hfang-bristol/Pi/master/inst/xPier.html)* -> *[`xRWR`](http://rawgit.com/hfang-bristol/Pi/master/inst/xRWR.html)* to prepare a gene-predict matrix and prioritisation of targe genes (`xPierMatrix`) taking as inputs disease-associated SNPs and their significance level (GWAS summary data) for specific traits. The output of this route is further taken as inputs for target pathway prioritisation (*[`xPierPathways`](http://rawgit.com/hfang-bristol/Pi/master/inst/xPierPathways.html)*) or pathway crosstalk identification (*[`xPierSubnet`](http://rawgit.com/hfang-bristol/Pi/master/inst/xPierSubnet.html)*)

Figure 2: Pi functions
Designed in a nested way following this route: xPierSNPsAdv -> xPierSNPs -> xPierGenes -> xPier -> xRWR to prepare a gene-predict matrix and prioritisation of targe genes (xPierMatrix) taking as inputs disease-associated SNPs and their significance level (GWAS summary data) for specific traits. The output of this route is further taken as inputs for target pathway prioritisation (xPierPathways) or pathway crosstalk identification (xPierSubnet)

6 Getting Started

Using curated GWAS summary data for rheumatoid arthritis (RA) as inputs, this section gives a step-by-step demo showing how to use Pi to achieve disease-specific prioritisation of drug targets at the gene and pathway level

First of all, load the package and specify the location of built-in data.

library(Pi)
RData.location <- "http://pi.well.ox.ac.uk/bigdata"

6.1 Input data

GWAS summary data for RA are primarily sourced from GWAS Catalog and ImmunoBase.

data.file <- file.path(RData.location, "GWAS_summary_data.RA.txt")
data <- read.delim(data.file, header=TRUE, stringsAsFactors=FALSE)

Table 1: The first 5 rows of data, with the column SNP for GWAS SNPs and the column pvalue for GWAS p-values
SNP pvalue
rs998731 7.0e-09
rs9979383 5.0e-10
rs9826828 8.6e-10
rs968567 1.8e-08
rs9653442 1.6e-18

6.2 Prioritisation of target genes

Parameters used

include.LD <- 'EUR'
LD.r2 <- 0.8
LD.customised <- NULL
significance.threshold <- 5e-8
score.cap <- 10
distance.max <- 20000
decay.kernel <- "constant"
decay.exponent <- 2
eQTL.customised <- NULL
GR.SNP <- "dbSNP_GWAS"
#GR.SNP <- xRDataLoader('dbSNP_GWAS', RData.location=RData.location) # equivalent to this
GR.Gene <- "UCSC_knownGene"
#GR.Gene <- xRDataLoader('UCSC_knownGene', RData.location=RData.location) # equivalent to this
include.TAD <- "GM12878" # lymphoblast, reflective of immune-context genomic organisation
include.eQTL <- c("JKscience_CD14","JKscience_LPS2","JKscience_LPS24","JKscience_IFN","JKng_bcell","JKpg_CD4","JKpg_CD8","JKnc_neutro", "JK_nk", "WESTRAng_blood")
eQTL.customised <- NULL
include.HiC <- c("Monocytes","Macrophages_M0","Macrophages_M1","Macrophages_M2","Neutrophils","Naive_CD4_T_cells","Total_CD4_T_cells","Naive_CD8_T_cells","Total_CD8_T_cells","Naive_B_cells","Total_B_cells")
cdf.function <- "empirical"
scoring.scheme <- 'max'
network <- "STRING_high"
STRING.only <- NA
weighted <- FALSE
network.customised <- NULL
#network.customised <- xDefineNet("STRING_high", RData.location=RData.location)  # equivalent to this
seeds.inclusive <- TRUE
normalise <- "laplacian"
restart <- 0.7
normalise.affinity.matrix <- "none"
parallel <- TRUE
multicores <- NULL
verbose <- TRUE

Table 2: Explanation of eQTL (eGene) datasets used
Code Name
JKscience_CD14 CD14+ (monocytes)
JKscience_LPS2 CD14+ by LPS2h
JKscience_LPS24 CD14+ by LPS24h
JKscience_IFN CD14+ by IFN24h
JKng_bcell B cells
JKpg_CD4 CD4+ T cells
JKpg_CD8 CD8+ T cells
JKnc_neutro neutrophils
JK_nk NK cells
WESTRAng_blood peripheral blood

Table 3: Explanation of promoter capture Hi-C (cGene) datasets used
Code Name
Monocytes monocytes
Macrophages_M0 macrophages (M0)
Macrophages_M1 macrophages (M1)
Macrophages_M2 macrophages (M2)
Neutrophils neutrophils
Naive_CD4_T_cells CD4+ T cells (naïve)
Total_CD4_T_cells CD4+ T cells (total)
Naive_CD8_T_cells CD8+ T cells (naïve)
Total_CD8_T_cells CD8+ T cells (total)
Naive_B_cells B cells (naïve)
Total_B_cells B cells (total)

Prioritisation

# prepare predictors
## first, genomic predictors
ls_pNode_genomic <- xPierSNPsAdv(data, include.LD=include.LD, LD.customised=LD.customised, LD.r2=LD.r2, significance.threshold=significance.threshold, score.cap=score.cap, distance.max=distance.max, decay.kernel=decay.kernel, decay.exponent=decay.exponent, GR.SNP=GR.SNP, GR.Gene=GR.Gene, include.TAD=include.TAD, include.eQTL=include.eQTL, eQTL.customised=eQTL.customised, include.HiC=include.HiC, cdf.function=cdf.function, scoring.scheme=scoring.scheme, network=network, STRING.only=STRING.only, weighted=weighted, network.customised=network.customised, seeds.inclusive=seeds.inclusive, normalise=normalise, restart=restart, normalise.affinity.matrix=normalise.affinity.matrix, parallel=parallel, multicores=multicores, verbose=verbose, RData.location=RData.location)
## then, annotation predictors
data.file <- file.path(RData.location, "iAnno.txt")
iAnno <- read.delim(data.file, header=TRUE, stringsAsFactors=FALSE)
iA <- iAnno[, 1:14]
ls_pNode_anno <- lapply(9:11, function(j){
    data_anno <- data.frame(seed=iA$Symbol, weight=iA[,j], stringsAsFactors=F)
    data_anno <- subset(data_anno, weight>0)
    pNode <- xPierAnno(data_anno, list_pNode=ls_pNode_genomic, network=network, STRING.only=STRING.only, weighted=weighted, network.customised=network.customised, seeds.inclusive=seeds.inclusive, normalise=normalise, restart=restart, normalise.affinity.matrix=normalise.affinity.matrix, parallel=parallel, multicores=multicores, verbose=verbose, RData.location=RData.location)
})
names(ls_pNode_anno) <- paste("Annotation_", colnames(iA)[9:11], sep='')
## bring together both predictors
ls_pNode <- c(ls_pNode_anno, ls_pNode_genomic)

# predictor matrix
df_predictor <- xPierMatrix(ls_pNode, displayBy="score", combineBy="union", aggregateBy="none", RData.location=RData.location)

# Prioritisation in a discovery mode
dTarget <- xPierMatrix(ls_pNode, displayBy="pvalue", combineBy="union", aggregateBy="fishers", RData.location=RData.location)   

The results are stored in an object called dTarget, a list with components including a data frame priority.

dTarget
## An object of S3 class 'dTarget', with 3 components:
##   $metag: an igraph object with 15430 nodes and 313918 edges
##   $predictor: a data frame of 15430 rows X 25 columns
##   $priority: a data frame of 15430 rows X 13 columns
## 
## --------------------------------------------------
## $priority:
##      name rank       pvalue          fdr priority
##  HLA-DRB1    1 2.715127e-43 4.189441e-39 5.000000
##  HLA-DQA1    2 2.003228e-41 1.545490e-37 4.889061
##                                             description Overall OMIM
##   major histocompatibility complex, class II, DR beta 1       4    1
##  major histocompatibility complex, class II, DQ alpha 1       5    1
##  Phenotype Function nearbyGenes eQTL HiC
##          0        0           1    6   7
##          1        1           1    0   9
## ......

The data frame dTarget$priority can be written into a text file Pi_output_priority.txt:

write.table(dTarget$priority, file="Pi_output_priority.txt", sep="\t", row.names=FALSE)

Manhattan plot

gp_manhattan <- xPierManhattan(dTarget, color='ggplot2', top=30, top.label.type="text", top.label.size=3, top.label.col="black", y.scale="normal", y.lab="Pi rating", RData.location=RData.location)
gp_manhattan
Manhattan plot. Priority scores (Pi rating) for genes are displayed on the Y-axis and genomic locations on the X-axis, with top 30 genes named

Figure 3: Manhattan plot
Priority scores (Pi rating) for genes are displayed on the Y-axis and genomic locations on the X-axis, with top 30 genes named

6.3 Prioritisation of target pathways

Pathway-level prioritisation is based on enrichment analysis of top 1% prioritised genes (top 150 genes) using a compendium of pathways (sourced from KEGG and REACTOME).

6.3.1 KEGG pathways

priority.top <- 150
eTerm1 <- xPierPathways(dTarget, priority.top=priority.top, ontology="KEGGorganismal", size.range=c(10,5000), test="fisher", min.overlap=10, RData.location=RData.location)
eTerm2 <- xPierPathways(dTarget, priority.top=priority.top, ontology="KEGGenvironmental", size.range=c(10,5000), test="fisher", min.overlap=10, RData.location=RData.location)
list_eTerm <- list(eTerm1, eTerm2)
names(list_eTerm) <- c('Organismal systems','Environmental information processing')
gp_kegg <- xEnrichCompare(list_eTerm, displayBy="zscore", FDR.cutoff=0.01, facet=F, signature=F)
gp_kegg
Barplot of prioritised KEGG pathways

Figure 4: Barplot of prioritised KEGG pathways

6.3.2 REACTOME pathways

priority.top <- 150
eTerm <- xPierPathways(dTarget, priority.top=priority.top, ontology="REACTOME", size.range=c(10,5000), test="fisher", min.overlap=10, RData.location=RData.location)
gp_reactome <- xEnrichForest(eTerm, top_num=10, FDR.cutoff=0.01, CI.one=F, zlim=c(0,30), colormap="ggplot2.top", wrap.width=50, signature=F)
gp_reactome
Forest plot of prioritised REACTOME pathways

Figure 5: Forest plot of prioritised REACTOME pathways

6.4 Identification of pathway crosstalk

An algorithm is developed to search for a subset of a gene network (merged from KEGG pathways) in a way that the resulting gene subnetwork (or crosstalk between different pathways) contains highly prioritised genes with a few less prioritised genes as linkers.

# obtain a gene network merged from KEGG pathways
networks <- c("KEGG_environmental","KEGG_organismal")
ls_ig <- lapply(networks, function(network){
    g <- xDefineNet(network=network, weighted=FALSE, verbose=FALSE, RData.location=RData.location)
})
ig <- xCombineNet(ls_ig, combineBy='union', attrBy="intersect", verbose=TRUE)

# search for a subset (with a desired node number ~30) of the gene network 
g <- xPierSubnet(dTarget, network.customised=ig, priority.quantile=0.1, subnet.size=30, RData.location=RData.location)

# visualise pathway crosstalk
gp_rating_evidence <- xVisEvidenceAdv(dTarget, nodes=V(g)$name, g=g, colormap="jet.top")
gp_rating_evidence
Visualisation of pathway crosstalk. Nodes/genes colored by priority rating and labelled with evidence (seed genes; colored segments)

Figure 6: Visualisation of pathway crosstalk
Nodes/genes colored by priority rating and labelled with evidence (seed genes; colored segments)

7 Advanced Demo

7.1 Target clustering and visualisation

Analysis of the top 1% of prioritised gene lists for 16 immune traits identifies 912 target genes, which is used for target clustering and visualisation within a supra-hexagon map.

Input data

data.file <- file.path(RData.location, "Data_912genes_16traits.txt")
data <- read.delim(data.file, header=TRUE, stringsAsFactors=FALSE, row.names=1)

Table 4: The first 5 rows and the first 4 columns of data, with Pi rating inside
ALG AS ASM CRO
ABCF1 1.283385 0.000732473 0.7226988 0.00887349
ABL1 2.852690 1.854954662 2.3322260 2.20392799
ACTA2 1.891979 3.135782177 1.2497953 1.32313727
ACTL6A 1.766088 1.318861748 2.0790265 1.21876566
ACTR1A 1.962215 1.058672828 2.0097595 1.56059778

Clustering

library(supraHex)
sMap <- sPipeline(data=data, xdim=8, ydim=8)
sBase <- sDmatCluster(sMap)
output <- visDmatHeatmap(sMap, data, sBase, base.color="lightblue-darkorange", base.separated.arg=list(col="black",lty=1,lwd=1), base.legend.location="bottomleft", colormap="#7FFF7F-yellow-#FF7F00-red", zlim=c(0,5), KeyValueName="Pi rating", labRow=NA, srtCol=90, keep.data=T)

The resulting data frame output can be written into a text file Pi_output_clustering.txt:

write.table(output, file="Pi_output_clustering.txt", sep="\t", row.names=FALSE)

Visualisation

visDmatCluster(sMap, sBase, colormap="lightblue-darkorange")
Target clustering for top 1% prioritised genes across 16 traits (supra-hexagonal map) defines 7 target clusters (colored), each continuous over the map.

Figure 7: Target clustering for top 1% prioritised genes across 16 traits (supra-hexagonal map) defines 7 target clusters (colored), each continuous over the map

visHexBarplot(sMap, colormap="yellow-cyan")
A supra-hexagonal map labelled with the built-in index and the number of target genes per cell/hexagon.

Figure 8: A supra-hexagonal map labelled with the built-in index and the number of target genes per cell/hexagon

visHexMulComp(sMap, rect.grid=c(4,4), colormap="#7FFF7F-yellow-#FF7F00-red", zlim=c(0,5))
Average rating per cell for individual traits, illustrating trait-specific prioritisation profiles.

Figure 9: Average rating per cell for individual traits, illustrating trait-specific prioritisation profiles

8 Session Info

Here is the output of sessionInfo() on the system on which this user manual was compiled:

## R version 3.4.3 (2017-11-30)
## Platform: x86_64-apple-darwin15.6.0 (64-bit)
## Running under: OS X El Capitan 10.11.6
## 
## Matrix products: default
## BLAS: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRblas.0.dylib
## LAPACK: /Library/Frameworks/R.framework/Versions/3.4/Resources/lib/libRlapack.dylib
## 
## locale:
## [1] en_GB.UTF-8/en_GB.UTF-8/en_GB.UTF-8/C/en_GB.UTF-8/en_GB.UTF-8
## 
## attached base packages:
## [1] grid      stats     graphics  grDevices utils     datasets  methods  
## [8] base     
## 
## other attached packages:
##  [1] bindrcpp_0.2    png_0.1-7       BiocStyle_2.7.8 Pi_1.7.3       
##  [5] XGR_1.1.4       ggplot2_2.2.1   dnet_1.1.4      supraHex_1.13.3
##  [9] hexbin_1.27.1   igraph_1.1.2   
## 
## loaded via a namespace (and not attached):
##   [1] tidyselect_0.2.3              RSQLite_2.0                  
##   [3] AnnotationDbi_1.40.0          htmlwidgets_0.9              
##   [5] BiocParallel_1.12.0           munsell_0.4.3                
##   [7] units_0.4-6                   codetools_0.2-15             
##   [9] misc3d_0.8-4                  withr_2.1.0                  
##  [11] colorspace_1.3-2              BiocInstaller_1.28.0         
##  [13] Biobase_2.38.0                OrganismDbi_1.20.0           
##  [15] highr_0.6                     knitr_1.20                   
##  [17] rstudioapi_0.7                stats4_3.4.3                 
##  [19] ROCR_1.0-7                    robustbase_0.92-8            
##  [21] dimRed_0.1.0                  labeling_0.3                 
##  [23] GenomeInfoDbData_0.99.1       mnormt_1.5-5                 
##  [25] bit64_0.9-7                   rprojroot_1.2                
##  [27] xfun_0.1                      ipred_0.9-6                  
##  [29] biovizBase_1.26.0             randomForest_4.6-12          
##  [31] R6_2.2.2                      doParallel_1.0.11            
##  [33] GenomeInfoDb_1.14.0           AnnotationFilter_1.2.0       
##  [35] DRR_0.0.2                     bitops_1.0-6                 
##  [37] reshape_0.8.7                 DelayedArray_0.4.1           
##  [39] assertthat_0.2.0              scales_0.5.0                 
##  [41] nnet_7.3-12                   gtable_0.2.0                 
##  [43] ddalpha_1.3.1                 ggbio_1.26.1                 
##  [45] ensembldb_2.2.0               timeDate_3042.101            
##  [47] rlang_0.1.6                   CVST_0.2-1                   
##  [49] RcppRoll_0.2.2                splines_3.4.3                
##  [51] rtracklayer_1.38.1            lazyeval_0.2.1               
##  [53] ModelMetrics_1.1.0            acepack_1.4.1                
##  [55] dichromat_2.0-0               broom_0.4.3                  
##  [57] checkmate_1.8.5               intergraph_2.0-2             
##  [59] yaml_2.1.18                   reshape2_1.4.2               
##  [61] GenomicFeatures_1.30.0        ggnetwork_0.5.1              
##  [63] backports_1.1.1               httpuv_1.3.5                 
##  [65] Hmisc_4.0-3                   RMySQL_0.10.13               
##  [67] RBGL_1.54.0                   caret_6.0-77                 
##  [69] tools_3.4.3                   lava_1.5.1                   
##  [71] bookdown_0.7                  psych_1.7.8                  
##  [73] statnet.common_4.0.0          gplots_3.0.1                 
##  [75] RColorBrewer_1.1-2            BiocGenerics_0.24.0          
##  [77] Rcpp_0.12.15                  plyr_1.8.4                   
##  [79] base64enc_0.1-3               progress_1.1.2               
##  [81] zlibbioc_1.24.0               purrr_0.2.4                  
##  [83] RCurl_1.95-4.8                prettyunits_1.0.2            
##  [85] deldir_0.1-14                 rpart_4.1-11                 
##  [87] S4Vectors_0.16.0              sfsmisc_1.1-1                
##  [89] SummarizedExperiment_1.8.0    ggrepel_0.7.0                
##  [91] cluster_2.0.6                 magrittr_1.5                 
##  [93] data.table_1.10.4-3           sna_2.4                      
##  [95] ProtGenerics_1.10.0           matrixStats_0.52.2           
##  [97] evaluate_0.10.1               mime_0.5                     
##  [99] xtable_1.8-2                  XML_3.98-1.9                 
## [101] IRanges_2.12.0                gridExtra_2.3                
## [103] compiler_3.4.3                biomaRt_2.34.0               
## [105] tibble_1.4.2                  KernSmooth_2.23-15           
## [107] htmltools_0.3.6               Formula_1.2-2                
## [109] udunits2_0.13                 tidyr_0.7.2                  
## [111] lubridate_1.7.1               DBI_0.7                      
## [113] tweenr_0.1.5                  MASS_7.3-47                  
## [115] Matrix_1.2-12                 gdata_2.18.0                 
## [117] parallel_3.4.3                Gviz_1.22.1                  
## [119] bindr_0.1                     gower_0.1.2                  
## [121] GenomicRanges_1.30.0          pkgconfig_2.0.1              
## [123] GenomicAlignments_1.14.1      RCircos_1.2.0                
## [125] foreign_0.8-69                recipes_0.1.1                
## [127] foreach_1.4.3                 XVector_0.18.0               
## [129] prodlim_1.6.1                 stringr_1.3.0                
## [131] VariantAnnotation_1.24.2      digest_0.6.12                
## [133] graph_1.56.0                  Biostrings_2.46.0            
## [135] rmarkdown_1.9                 htmlTable_1.11.0             
## [137] curl_3.0                      kernlab_0.9-25               
## [139] shiny_1.0.5                   Rsamtools_1.30.0             
## [141] gtools_3.5.0                  nlme_3.1-131                 
## [143] network_1.13.0                BSgenome_1.46.0              
## [145] pillar_1.1.0                  lattice_0.20-35              
## [147] GGally_1.3.2                  httr_1.3.1                   
## [149] DEoptimR_1.0-8                survival_2.41-3              
## [151] interactiveDisplayBase_1.16.0 glue_1.2.0                   
## [153] iterators_1.0.8               plot3D_1.1.1                 
## [155] glmnet_2.0-13                 bit_1.1-12                   
## [157] Rgraphviz_2.22.0              ggforce_0.1.1                
## [159] class_7.3-14                  stringi_1.1.7                
## [161] blob_1.1.0                    AnnotationHub_2.10.1         
## [163] latticeExtra_0.6-28           caTools_1.17.1               
## [165] memoise_1.1.0                 dplyr_0.7.4                  
## [167] ape_5.0